Multilabel Text Classification Done Right Using Scikit-learn and Stacked Generalization

https://towardsdatascience.com/multilabel-text-classification-done-right-using-scikit-learn-and-stacked-generalization-f5df2defc3b5

2022/06

データが公開されていないようで再現させるのが難しそう

It has two features: problem containing math problems in LaTeX format and tags populated with one or two classes among algebra, combinatorics, geometry, or number theory.

説明変数：LaTeX記法で書かれた問題文（problem）

目的変数：問題のタグ（tags）

multilabel（機械学習用語）

データセットサイズ：22790

You will build your model step by step starting from the simplest and adding complexity along the way:

Random prediction

Rule-based prediction

Machine Learning

前処理：正規表現でLaTeX記法を取り除く

recursive regular expressions

scikit-learn: train_test_splitのstratifyパラメタ

train / validation / test = 70% / 15% / 15%

今回は不要だが、7. Multi-label data stratificationを案内

ルールベース：特定の語を含んでいたらtagを予測

機械学習：tfidfのngram range 2通り × 7つのアルゴリズム

ベスト3をstacking（f1 0.85）

You also notice class imbalance and know how to treat it using the class_weight parameter in scikit-learn.

TODO class_weight

TODO Switcherを作ってのグリッドサーチのやり方は参考になりそう

リポジトリ https://github.com/dwiuzila/tagolym

dvcで管理しているが、外部ストレージにないので再現のために入手できなさそう